So, what is an LLM anyway?

A Backstage Tour of Architecture and Tuning

KHUN Kimang

ក្រង AI

2026-02-26

About Me

To know more, check out my website here: https://kimang18.github.io/

Large Language Model

LLM is a “computer program”

There are 2 stages in creating an LLM

Pre-training: learn laws of human language and world facts

  • need an enormous amount of text corpus
  • need a well designed model architecture
  • need an enormous amount of computational resource

Think of it like “compressing the knowledge into your model”

Pre-training: learn laws of human language and world facts

For Llama 2 70B, it is like “compressing the internet”

A neural network that predicts the next word in the sequence

To predict the next word, the neural network must first compress the world

The base model “dreams” internet documents

SFT: learn to interface with human

The instruct model is able to chat with you

It learns to answer (e.g. questions, commands, or making suggestions, etc.)

The instruct model is able to chat with you

But, it imitates a single ‘correct’ answer or straightly follow fact.

RLHF: learn human values and preferences

Summary: how to train your LLM

Pre-training

  1. Prepare large text corpus (~10TB of text for Llama2 70b)
  2. Get a cluster of GPUs (~6,000 GPUs for Llama2 70b)
  3. Compress the text into a neural network (pay ~$2M, wait ~12 days)
  4. Obtain base model

Finetuning

  1. Write labeling instructions
  2. Hire people, collect high quality ideal Q&A responses, and/or comparisons
  3. Finetune base model on this data (wait ~1 day)
  4. Obtain chat model
  5. Run a lot of evaluations
  6. Deploy.
  7. Monitor, collect misbehaviors, go to Step 1.

LLM scaling laws: exponential improvement is possible

Performance of LLMs is a smooth, well-behaved, predictable function of:

  • the number of parameters of the network (i.e. model size)
  • the amount of text we train on

and the trends do not show signs of ‘topping out’.

We can expet more intelligence “for free” by scaling.

We can expect a lot more “general capability” across all areas of knowledge

About Now and The Future

LLMs are still not smart enough

The reversal curse: Being trained on “A is B”, it fails to learn “B is A”.

We are entering a period of diminishing returns

Base Model Seems to be an Upper Bound for Reasoning Capability (Yue et al. 2025)

We are entering a period of diminishing returns

“Il y a un moment donné où on arrive dans un regime où en fait on a fini de compresser la connaissance humaine” (sourced from this interview)

– Arthur Mensch, CEO, Mistral AI

From Generalist to Specialist

Customize LLMs to your task and context

LLMs are trained to understand and exploit the context.

There are 3 fine-tuning techniques

A few solutions for fine-tuning LLMs

  • Unsloth: fast and now support multiGPU
  • Axolotl: fast and support multiGPU
  • TRL: fast and support multiGPU
  • MLX-LM: fast and support multiGPU (Apple Silicon only)

Demo: fine-tune LLama 3.1 8B

Use QLoRA technique and mlabonne/FineTome-100k dataset

Install packages1

pip install unsloth vllm

Import dependencies

import torch
from unsloth import FastLanguageModel, is_bfloat16_supported
from datasets import load_dataset
from unsloth.chat_templates import get_chat_template
from trl import SFTTrainer
from transformers import TrainingArguments, TextStreamer

Load model and tokenizer

Load pre-quantized model in \(4\)-bit precision of meta-llama/Llama-3.1-8B-Instruct.

max_seq_length = 2048
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-bnb-4bit",
    max_seq_length=max_seq_length,
    load_in_4bit=True,
    dtype=None,
)

Apply low rank adaption technique

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    target_modules=["q_proj", "k_proj", "v_proj", "up_proj", "down_proj", "o_proj", "gate_proj"], 
    use_rslora=True,
    use_gradient_checkpointing="unsloth"
)

Check for names of the linear modules

FineTome-100k

This dataset uses ShareGPT template.

Finding more detail here: https://huggingface.co/datasets/mlabonne/FineTome-100k.

Chat template

Llama3 uses ChatML template

# ChatML template
# messages = [
#     {"role": "system", "content": "..."},
#     {"role": "user", "content": "..."},
#     {"role": "assistant", "content": "..."},
# ]

=> Do the mapping between ShareGPT and ChatML

tokenizer = get_chat_template(
    tokenizer,
    mapping={"role": "from", "content": "value", "user": "human", "assistant": "gpt"},
    chat_template="chatml",
)

def apply_template(examples):
    messages = examples["conversations"]
    text = [tokenizer.apply_chat_template(message, tokenize=False, add_generation_prompt=False) for message in messages]
    return {"text": text}

dataset = load_dataset("mlabonne/FineTome-100k", split="train")
dataset = dataset.map(apply_template, batched=True)

We get text ready to be tokenized

print(dataset["text"][0])
# This gives something similar to below
"""
<|im_start|>system
You are a helpful assistant, who always provide explanation. Think like you are answering to a five year old.<|im_end|>
<|im_start|>user
Remove the spaces from the following sentence: It prevents users to suspect that there are some hidden products installed on theirs device.<|im_end|>
<|im_start|>assistant
Itpreventsuserstosuspectthattherearesomehiddenproductsinstalledontheirsdevice.<|im_end|>
"""

Training Setup

trainer=SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=True,
    args=TrainingArguments(
        learning_rate=3e-4,
        lr_scheduler_type="linear",
        per_device_train_batch_size=8,
        gradient_accumulation_steps=2,
        num_train_epochs=1,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        warmup_steps=10,
        output_dir="output",
        seed=0,
    ),
)

trainer.train()

Evaluation

Evaluate your model on test dataset or different benchmarks and tools like llm-autoeval.

# Simple inference
model = FastLanguageModel.for_inference(model)

messages = [
    {"from": "human", "value": "Is 9.11 larger than 9.9?"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to("cuda")

text_streamer = TextStreamer(tokenizer)
_ = model.generate(input_ids=inputs, streamer=text_streamer, max_new_tokens=128, use_cache=True)

Save and upload model

Once you’re satisfied with your model, you can merge the adapter and share it

# save to local disk
model.save_pretrained_merged("model", tokenizer, save_method="merged_16bit")
# upload to huggingface repo: mlabonne/FineLlama-3.1-8B
model.push_to_hub_merged("mlabonne/FineLlama-3.1-8B", tokenizer, save_method="merged_16bit")

For faster inference with llama.cpp (or LMStudio, Ollama, etc.), you can upload the GGUF formats

# available quantization methods
quant_methods = ["q2_k", "q3_k_m", "q4_k_m", "q5_k_m", "q6_k", "q8_0"]
# push quantized format to repo: mlabonne/FineLlama-3.1-8B-GGUF
for quant in quant_methods:
    model.push_to_hub_gguf("mlabonne/FineLlama-3.1-8B-GGUF", tokenizer, quant)

KrorngAI YT Channel

References

Berglund, Lukas, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, and Owain Evans. 2024. “The Reversal Curse: LLMs Trained on "a Is b" Fail to Learn "b Is a".” https://arxiv.org/abs/2309.12288.
Hoffmann, Jordan, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, et al. 2022. “Training Compute-Optimal Large Language Models.” https://arxiv.org/abs/2203.15556.
Villalobos, Pablo, Jaime Sevilla, Lennart Heim, Tamay Besiroglu, Marius Hobbhahn, and Anson Ho. 2022. “Will We Run Out of Data? An Analysis of the Limits of Scaling Datasets in Machine Learning.” arXiv Preprint arXiv:2211.04325 1 (1).
Yue, Yang, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. 2025. “Does Reinforcement Learning Really Incentivize Reasoning Capacity in Llms Beyond the Base Model?” arXiv Preprint arXiv:2504.13837.